19 research outputs found
Who is the Real Hero? Measuring Developer Contribution via Multi-dimensional Data Integration
Proper incentives are important for motivating developers in open-source
communities, which is crucial for maintaining the development of open-source
software healthy. To provide such incentives, an accurate and objective
developer contribution measurement method is needed. However, existing methods
rely heavily on manual peer review, lacking objectivity and transparency. The
metrics of some automated works about effort estimation use only syntax-level
or even text-level information, such as changed lines of code, which lack
robustness. Furthermore, some works about identifying core developers provide
only a qualitative understanding without a quantitative score or have some
project-specific parameters, which makes them not practical in real-world
projects. To this end, we propose CValue, a multidimensional information
fusion-based approach to measure developer contributions. CValue extracts both
syntax and semantic information from the source code changes in four
dimensions: modification amount, understandability, inter-function and
intra-function impact of modification. It fuses the information to produce the
contribution score for each of the commits in the projects. Experimental
results show that CValue outperforms other approaches by 19.59% on 10
real-world projects with manually labeled ground truth. We validated and proved
that the performance of CValue, which takes 83.39 seconds per commit, is
acceptable to be applied in real-world projects. Furthermore, we performed a
large-scale experiment on 174 projects and detected 2,282 developers having
inflated commits. Of these, 2,050 developers did not make any syntax
contribution; and 103 were identified as bots
Cross-Lingual Adaptation for Type Inference
Deep learning-based techniques have been widely applied to the program
analysis tasks, in fields such as type inference, fault localization, and code
summarization. Hitherto deep learning-based software engineering systems rely
thoroughly on supervised learning approaches, which require laborious manual
effort to collect and label a prohibitively large amount of data. However, most
Turing-complete imperative languages share similar control- and data-flow
structures, which make it possible to transfer knowledge learned from one
language to another. In this paper, we propose cross-lingual adaptation of
program analysis, which allows us to leverage prior knowledge learned from the
labeled dataset of one language and transfer it to the others. Specifically, we
implemented a cross-lingual adaptation framework, PLATO, to transfer a deep
learning-based type inference procedure across weakly typed languages, e.g.,
Python to JavaScript and vice versa. PLATO incorporates a novel joint graph
kernelized attention based on abstract syntax tree and control flow graph, and
applies anchor word augmentation across different languages. Besides, by
leveraging data from strongly typed languages, PLATO improves the perplexity of
the backbone cross-programming-language model and the performance of downstream
cross-lingual transfer for type inference. Experimental results illustrate that
our framework significantly improves the transferability over the baseline
method by a large margin
An Empirical Study of Malicious Code In PyPI Ecosystem
PyPI provides a convenient and accessible package management platform to
developers, enabling them to quickly implement specific functions and improve
work efficiency. However, the rapid development of the PyPI ecosystem has led
to a severe problem of malicious package propagation. Malicious developers
disguise malicious packages as normal, posing a significant security risk to
end-users.
To this end, we conducted an empirical study to understand the
characteristics and current state of the malicious code lifecycle in the PyPI
ecosystem. We first built an automated data collection framework and collated a
multi-source malicious code dataset containing 4,669 malicious package files.
We preliminarily classified these malicious code into five categories based on
malicious behaviour characteristics. Our research found that over 50% of
malicious code exhibits multiple malicious behaviours, with information
stealing and command execution being particularly prevalent. In addition, we
observed several novel attack vectors and anti-detection techniques. Our
analysis revealed that 74.81% of all malicious packages successfully entered
end-user projects through source code installation, thereby increasing security
risks. A real-world investigation showed that many reported malicious packages
persist in PyPI mirror servers globally, with over 72% remaining for an
extended period after being discovered. Finally, we sketched a portrait of the
malicious code lifecycle in the PyPI ecosystem, effectively reflecting the
characteristics of malicious code at different stages. We also present some
suggested mitigations to improve the security of the Python open-source
ecosystem.Comment: Accepted by the 38th IEEE/ACM International Conference on Automated
Software Engineering (ASE2023
A Survey on Automated Program Repair Techniques
With the rapid development and large-scale popularity of program software,
modern society increasingly relies on software systems. However, the problems
exposed by software have also come to the fore. Software defect has become an
important factor troubling developers. In this context, Automated Program
Repair (APR) techniques have emerged, aiming to automatically fix software
defect problems and reduce manual debugging work. In particular, benefiting
from the advances in deep learning, numerous learning-based APR techniques have
emerged in recent years, which also bring new opportunities for APR research.
To give researchers a quick overview of APR techniques' complete development
and future opportunities, we revisit the evolution of APR techniques and
discuss in depth the latest advances in APR research. In this paper, the
development of APR techniques is introduced in terms of four different patch
generation schemes: search-based, constraint-based, template-based, and
learning-based. Moreover, we propose a uniform set of criteria to review and
compare each APR tool, summarize the advantages and disadvantages of APR
techniques, and discuss the current state of APR development. Furthermore, we
introduce the research on the related technical areas of APR that have also
provided a strong motivation to advance APR development. Finally, we analyze
current challenges and future directions, especially highlighting the critical
opportunities that large language models bring to APR research.Comment: This paper's earlier version was submitted to CSUR in August 202
Compatible Remediation on Vulnerabilities from Third-Party Libraries for Java Projects
With the increasing disclosure of vulnerabilities in open-source software,
software composition analysis (SCA) has been widely applied to reveal
third-party libraries and the associated vulnerabilities in software projects.
Beyond the revelation, SCA tools adopt various remediation strategies to fix
vulnerabilities, the quality of which varies substantially. However,
ineffective remediation could induce side effects, such as compilation
failures, which impede acceptance by users. According to our studies, existing
SCA tools could not correctly handle the concerns of users regarding the
compatibility of remediated projects. To this end, we propose Compatible
Remediation of Third-party libraries (CORAL) for Maven projects to fix
vulnerabilities without breaking the projects. The evaluation proved that CORAL
not only fixed 87.56% of vulnerabilities which outperformed other tools (best
75.32%) and achieved a 98.67% successful compilation rate and a 92.96%
successful unit test rate. Furthermore, we found that 78.45% of vulnerabilities
in popular Maven projects could be fixed without breaking the compilation, and
the rest of the vulnerabilities (21.55%) could either be fixed by upgrades that
break the compilations or even be impossible to fix by upgrading.Comment: 11 pages, conferenc
Towards Understanding Third-party Library Dependency in C/C++ Ecosystem
Third-party libraries (TPLs) are frequently reused in software to reduce
development cost and the time to market. However, external library dependencies
may introduce vulnerabilities into host applications. The issue of library
dependency has received considerable critical attention. Many package managers,
such as Maven, Pip, and NPM, are proposed to manage TPLs. Moreover, a
significant amount of effort has been put into studying dependencies in
language ecosystems like Java, Python, and JavaScript except C/C++. Due to the
lack of a unified package manager for C/C++, existing research has only few
understanding of TPL dependencies in the C/C++ ecosystem, especially at large
scale.
Towards understanding TPL dependencies in the C/C++ecosystem, we collect
existing TPL databases, package management tools, and dependency detection
tools, summarize the dependency patterns of C/C++ projects, and construct a
comprehensive and precise C/C++ dependency detector. Using our detector, we
extract dependencies from a large-scale database containing 24K C/C++
repositories from GitHub. Based on the extracted dependencies, we provide the
results and findings of an empirical study, which aims at understanding the
characteristics of the TPL dependencies. We further discuss the implications to
manage dependency for C/C++ and the future research directions for software
engineering researchers and developers in fields of library development,
software composition analysis, and C/C++package manager.Comment: ASE 202
Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study
Large Language Models (LLMs), like ChatGPT, have demonstrated vast potential
but also introduce challenges related to content constraints and potential
misuse. Our study investigates three key research questions: (1) the number of
different prompt types that can jailbreak LLMs, (2) the effectiveness of
jailbreak prompts in circumventing LLM constraints, and (3) the resilience of
ChatGPT against these jailbreak prompts. Initially, we develop a classification
model to analyze the distribution of existing prompts, identifying ten distinct
patterns and three categories of jailbreak prompts. Subsequently, we assess the
jailbreak capability of prompts with ChatGPT versions 3.5 and 4.0, utilizing a
dataset of 3,120 jailbreak questions across eight prohibited scenarios.
Finally, we evaluate the resistance of ChatGPT against jailbreak prompts,
finding that the prompts can consistently evade the restrictions in 40 use-case
scenarios. The study underscores the importance of prompt structures in
jailbreaking LLMs and discusses the challenges of robust jailbreak prompt
generation and prevention
Machine learning methods for Android malware detection
With the Android mobile device becoming increasingly popular, the Android application market has become a main target of the malware attacks. Therefore, many methods have been used to protect the mobile application users from being attacked. However, those methods have shortcomings in detecting the malware within a short time, and can be easily bypassed. To detect the malware before the installed time, and overcome the drawbacks of dynamic analysis and signature based analysis, the machine learning based malware detection methods has been proposed. In this project, I have adopted this approach to develop a tool to extract Android application features, and built the classification model using the generated feature sets. The result shows that classification the model can reach 98% accuracy in predicting the maliciousness of the application. I have also generated the transformation attack samples, which will be used in further machine learning based malware detection studies.Bachelor of Engineering (Computer Science